Locality-aware parallel block-sparse matrix-matrix multiplication using the Chunks and Tasks programming model

نویسندگان

  • Emanuel H. Rubensson
  • Elias Rudberg
چکیده

We present a library for parallel block-sparse matrix-matrix multiplication on distributed memory clusters. By using a quadtree matrix representation data locality is exploited without any prior information about the matrix sparsity pattern. A distributed quadtree matrix representation is straightforward to implement due to our recent development of the Chunks and Tasks programming model [Parallel Comput. 40, 328 (2014)]. The quadtree representation combined with the Chunks and Tasks model leads to favorable weak and strong scaling of the communication cost with the number of processes, as shown both theoretically and in numerical experiments. Matrices are represented by sparse quadtrees of chunk objects. The leaves in the hierarchy are block-sparse submatrices. Sparsity is dynamically detected by the matrix library and may occur at any level in the hierarchy and/or within the submatrix leaves. In case Graphics Processing Units (GPUs) are available, both CPUs and GPUs are used for leaf-level multiplication work, thus making use of the full computing capacity of each node. The performance of our matrix-matrix multiplication implementation is evaluated for dense and banded matrices and matrices with randomly distributed submatrix blocks, as well as matrices from linear scaling electronic structure calculations. We show that, compared to methods that do not exploit data locality, our locality-aware approach reduces communication significantly, achieving essentially constant communication per node in weak scaling tests.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Scalable parallelization of dynamic algorithms using the Chunks and Tasks programming model

We describe how the Chunks and Tasks programming model can be used for efficient parallelization of computations. In the Chunks and Tasks model there is no message passing, instead the application programmer specifies how to divide the work into small pieces (tasks) that can be executed in parallel. Abstractions for data (chunks) are also provided. The application programmer need not worry abou...

متن کامل

Chunks and Tasks: A programming model for parallelization of dynamic algorithms

We propose Chunks and Tasks, a parallel programming model built on abstractions for both data and work. The application programmer specifies how data and work can be split into smaller pieces, chunks and tasks, respectively. The Chunks and Tasks library maps the chunks and tasks to physical resources. In this way we seek to combine user friendliness with high performance. An application program...

متن کامل

Data and Computation Abstractions for Dynamic and Irregular Computations

Effective data distribution and parallelization of computations involving irregular data structures is a challenging task. We address the twin-problems in the context of computations involving block-sparse matrices. The programming model provides a global view of a distributed block-sparse matrix. Abstractions are provided for the user to express the parallel tasks in the computation. The tasks...

متن کامل

A New Parallel Matrix Multiplication Method Adapted on Fibonacci Hypercube Structure

The objective of this study was to develop a new optimal parallel algorithm for matrix multiplication which could run on a Fibonacci Hypercube structure. Most of the popular algorithms for parallel matrix multiplication can not run on Fibonacci Hypercube structure, therefore giving a method that can be run on all structures especially Fibonacci Hypercube structure is necessary for parallel matr...

متن کامل

The design and analysis of bulk-synchronous parallel algorithms

The model of bulk-synchronous parallel (BSP) computation is an emerging paradigm of general-purpose parallel computing. This thesis presents a systematic approach to the design and analysis of BSP algorithms. We introduce an extension of the BSP model, called BSPRAM, which reconciles shared-memory style programming with e cient exploitation of data locality. The BSPRAM model can be optimally si...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Parallel Computing

دوره 57  شماره 

صفحات  -

تاریخ انتشار 2016